fix(bigquery-analytics): stop exporting plugin-owned OTel spans (#94)#5
Merged
Conversation
…le#94) When Agent Engine telemetry is enabled (GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true) and Cloud Trace export is configured on the global OTel tracer provider, the BigQuery Agent Analytics plugin causes every instrumented operation to appear as TWO spans in Cloud Trace — the framework's real span plus a duplicate plugin-owned span. Reported in GoogleCloudPlatform/BigQuery-Agent-Analytics-SDK#94 (also at haiyuan-eng-google#94) by a regular BQAA + Agent Engine user. Root cause ---------- TraceManager.push_span() called tracer.start_span(...) purely as an ID carrier, so the plugin could populate BigQuery span_id / parent_span_id columns. The author was aware of one related class of pitfall (re-parenting the framework's spans — fixed by not attaching the plugin span to the ambient context, see google#4561), but that guard does not stop the plugin span from going through the configured SpanProcessor → BatchSpanProcessor → Cloud Trace exporter. Once Agent Engine wires the global provider to Cloud Trace, every push/pop pair is exported. The plugin already tracked every parent/child relationship on its own contextvar-backed _SpanRecord stack — the OTel span object was incidental to correctness. Fix (scoped to TraceManager methods + _SpanRecord) ------------------------------------------------- * _SpanRecord no longer holds an OTel span object. It carries span_id, trace_id, owns_span, start_time_ns, first_token_time. * push_span generates a 16-hex span_id directly and inherits trace_id by precedence: top of internal stack → ambient OTel span → freshly generated 32-hex. No tracer.start_span call. * attach_current_span extracts trace_id/span_id from the ambient span (the existing path the framework already supports) and stores them as plain strings — no longer holds the OTel object. * pop_span / clear_stack drop the .end()/.start_time machinery since there is no OTel span to end. * get_trace_id / get_start_time read directly from the record. The signatures of push_span / attach_current_span / pop_span / clear_stack / get_trace_id / get_start_time are unchanged. Module-level `tracer = trace.get_tracer(...)` is retained for ABI compat (currently unused by plugin code; can be removed in a follow-up if no external consumers are identified). attach_current_span() is otherwise untouched — it only observed the ambient span; it never created one. That path was already correct and remains so. Cross-system correlation ------------------------ BigQuery rows still carry trace_id, inherited from the ambient Agent Engine / Runner span when present. Joining `agent_events` to Cloud Trace by trace_id continues to work end-to-end. Tests ----- Three existing tests that were directly asserting the bug (test_otel_integration, test_otel_integration_real_provider, test_clear_stack_ends_owned_spans) are rewritten as inverted regression guards: * test_push_pop_does_not_call_tracer_start_span * test_push_pop_does_not_export_spans_through_real_provider * test_clear_stack_does_not_export_spans Each asserts that the corresponding code path does NOT export an OTel span via an InMemorySpanExporter wired to a real provider, or does NOT invoke tracer.start_span via a mock spy. Four new tests added to lock in the lifecycle / inheritance contracts the plugin must keep: * test_push_span_inherits_ambient_trace_id — when an ambient OTel span exists (the Agent Engine pattern), the plugin's trace_id matches it. * test_llm_request_response_share_span_id_contract — the paired LLM_REQUEST / LLM_RESPONSE rows share one span_id (the documented BQAA join contract). * test_tool_starting_completed_share_span_id_contract — same invariant for the tool lifecycle pair. * test_streaming_llm_response_shares_span_id_until_final_contract — multiple partial LLM_RESPONSE callbacks reuse the same span_id and only the final fire pops the span. Prevents a future "dedupe" of streaming rows from breaking the contract. 226/229 plugin tests pass (6 skipped for unrelated optional deps); pyink + isort clean. Refs: - haiyuan-eng-google/BigQuery-Agent-Analytics-SDK#94 (reported by a customer) - google#4561 (prior fix for span-hierarchy re-parenting) - google#4645 (prior fix for trace_id fracture)
9c8f27d to
a01758a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the duplicate-span bug Guillaume reported on haiyuan-eng-google#94 — when Agent Engine telemetry is on (
GOOGLE_CLOUD_AGENT_ENGINE_ENABLE_TELEMETRY=true), every BQAA-instrumented operation surfaces as two spans in Cloud Trace: the framework's real span plus a duplicate plugin-owned span.Root cause
TraceManager.push_span()calledtracer.start_span(...)to mint an OTel span purely as an ID carrier for BigQueryspan_id/parent_span_idcolumns. The author was aware of one related pitfall — re-parenting the framework's own spans, fixed by not attaching the plugin span to the ambient context (cf. google#4561) — but that guard does not prevent the plugin span from going through the configuredSpanProcessor → BatchSpanProcessor → Cloud Trace exporterpipeline. Once Agent Engine wires the global provider to Cloud Trace, every push/pop pair gets exported.The plugin already tracked every parent/child relationship on its own contextvar-backed
_SpanRecordstack — the OTel span object was incidental to correctness.Fix (scoped to
TraceManager+_SpanRecord)_SpanRecordspan_id,trace_id,owns_span,start_time_ns,first_token_time.push_spanspan_iddirectly. Inheritstrace_idby precedence: top of internal stack → ambient OTel span → freshly generated 32-hex. Notracer.start_span(...)call.attach_current_spantrace_id/span_idfrom the ambient span as strings (the existing pattern); no longer stores the OTel object.pop_span/clear_stack.end()/ OTelstart_timemachinery — there is no OTel span to end.get_trace_id/get_start_timeMethod signatures unchanged. The module-level
tracer = trace.get_tracer(...)is retained for ABI compat (currently unused by plugin code; can be removed in a follow-up if no external consumers are found).Cross-system correlation preserved
BigQuery rows still carry
trace_id, inherited from the ambient Agent Engine / Runner span when present. Joiningagent_eventsto Cloud Trace bytrace_idcontinues to work end-to-end. That was the actual reason the OTel span was created in the first place — and it's preserved by extracting the trace_id from the ambient context instead of starting a new span.Tests
Three existing tests directly asserted the bug (the "OTel spans are created and exported" contract). They're rewritten as inverted regression guards against re-introduction:
test_otel_integrationtest_push_pop_does_not_call_tracer_start_spantest_otel_integration_real_providertest_push_pop_does_not_export_spans_through_real_providertest_clear_stack_ends_owned_spanstest_clear_stack_does_not_export_spansEach wires a real
InMemorySpanExporter(or mock spy ontracer) and asserts the exporter sees zero spans /tracer.start_spanis not called after a full push → pop cycle.Four new tests lock in the lifecycle / inheritance contracts the plugin must keep:
test_push_span_inherits_ambient_trace_idtrace_idmatches it.test_llm_request_response_share_span_id_contractspan_id— the documented BQAA join contract.test_tool_starting_completed_share_span_id_contracttest_streaming_llm_response_shares_span_id_until_final_contractspan_id; only the final fire pops. Prevents a future "dedupe" of streaming rows from breaking the contract.226/229 plugin tests pass (6 skipped for unrelated optional deps; 0 fail).
pyink+isortclean.What's deliberately NOT in this PR
tracersymbol. Kept for ABI compat; can be cleaned up in a follow-up after a downstream-consumer scan.google/adk-python). PR lands on the fork first; cherry-pick upstream after review.Diff stat
Refs
Test plan
pytest tests/unittests/plugins/test_bigquery_agent_analytics_plugin.py -q— 226 pass / 6 skip / 0 failpyink --config pyproject.toml --check— clean